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ABSTRACT 

Forspeaker-independentspeechrecognition,  speakervariation  is  one 
of  the  major  error  sources.  In  this  paper,  a  speaker-independent  nor¬ 
malization  network  is  constructed  such  that  speaker  variation  effects 
can  be  minimized.  To  achieve  this  goal,  multiple  speaker  clusters 
are  constructed  from  the  speaker-independent  training  database.  A 
codeword-dependent  neural  network  is  associated  with  each  speaker 
cluster.  The  cluster  that  contains  the  largest  number  of  speakers 
is  designated  as  the  golden  cluster.  The  objective  function  is  to 
minimize  distortions  between  acoustic  data  in  each  cluster  and  the 
golden  speakercluster.  Performanceevaluation  showed  that  speaker- 
normalized  front-end  reduced  the  error  rate  by  15%  for  the  DARPA 
resource  management  speaker-independent  speech  recognition  task. 

1.  INTRODUCTION 

For  speaker-independent  speech  recognition,  speaker  varia¬ 
tion  is  one  of  the  major  error  sources.  As  a  typical  ex¬ 
ample,  the  error  rate  of  a  well-trained  speaker-dependent 
speech  recognition  system  is  three  times  less  than  that  of 
a  speaker- independent  speech  recognition  system  [11],  To 
minimize  speaker  variation  effects,  we  can  use  either  speaker- 
clustered  models  [28, 1 1]  or  speaker  normalization  techniques 
[2, 24, 3, 25, 7],  .Speaker  normalization  is  interesting  since  its 
application  is  not  restricted  to  a  specific  type  of  speech  recog¬ 
nition  systems.  In  comparison  with  speaker  normalization 
techniques,  speaker-clustered  models  will  not  only  fragment 
data,  but  also  increase  the  computational  complexity  sub¬ 
stantially,  since  multiple  models  have  to  be  maintained  and 
compared  during  recognition. 

Recently,  nonlinear  mapping  based  on  neural  networks  has 
attracted  considerable  attention  because  of  the  ability  of  these 
networks  to  optimally  adjust  the  parameters  from  the  train¬ 
ing  data  to  approximate  the  nonlinear  relationship  between 
two  observed  spaces  (see  [22,  23]  for  a  review),  albeit  much 
remains  to  be  clarified  regarding  practical  applications.  Non¬ 
linear  mapping  of  two  different  observation  spaces  is  of  great 
interest  for  both  theoretical  and  practical  purposes.  In  the  area 
of  speech  processing,  nonlinear  mapping  has  been  applied  to 
noise  enhancement  [1,  32],  articulatory  motion  estimation 
[29,  18],  and  speech  recognition  [16].  Neural  networks  have 
been  used  successfully  to  transform  data  of  a  new  speaker  to 
a  reference  speaker  for  speaker-adaptive  speech  recognition 
[11],  In  this  paper,  we  will  study  how  neural  networks  can  be 


employed  to  minimize  speaker  variation  effects  for  speaker- 
independent  speech  recognition.  The  network  is  used  as  a 
nonlinear  mapping  function  to  transform  speech  data  between 
two  speaker  clusters.  The  mapping  function  we  used  is  char¬ 
acterized  by  three  important  properties.  First,  the  assembly 
of  mapping  functions  enhances  overall  mapping  quality.  Sec¬ 
ond,  multiple  input  vectors  are  used  simultaneously  in  the 
transformation.  Finally,  the  mapping  function  is  derived  from 
training  data  and  the  quality  will  dependent  on  the  available 
amount  of  training  data. 

We  used  the  DARPA  Resource  Management  (RM)  task  [27] 
as  our  domain  to  investigate  the  performance  of  speaker  nor¬ 
malization.  The  997-word  RM  task  is  a  database  query 
task  designed  from  900  sentence  templates  [27].  We  used 
word-pair  grammar  that  has  a  test-set  perplexity  of  about  60. 
The  speaker-independent  training  speech  database  consists  of 
3990  training  sentences  from  109  speakers  [26]).  The  test  set 
comprises  of  a  total  of  600  sentences  from  20  speakers.  We 
used  all  training  sentences  to  create  multiple  speaker  clusters. 
A  codeword-dependent  neural  network  is  associated  with  each 
speaker  cluster.  The  cluster  that  contains  the  largest  number 
of  speakers  is  designated  as  the  golden  cluster.  The  objec¬ 
tive  function  is  to  minimize  distortions  between  acoustic  data 
in  each  cluster  and  the  golden  speaker  cluster.  Performance 
evaluation  showed  that  speaker-normalized  front-end  reduced 
the  error  rate  by  15%  for  the  DARPA  resource  management 
speaker-independent  speech  recognition  task. 

This  paper  is  organized  as  follows.  In  Section  2,  the 
speech  recognition  system  SPHINX-11  is  reviewed.  Section  3 
presents  neural  network  architecture.  Section  4  discusses  its 
applications  to  speaker-independent  speech  recognition.  Our 
findings  are  summarized  in  Section  5. 

2.  REVIEW  OF  THE  SPHINX-H  SYSTEM 

In  comparison  with  the  SPHINX  system  [20],  the  SPHINX-II 
system  [6]  reduced  the  word  error  rate  by  more  than  50% 
through  incorporating  between-word  coarticulation  model¬ 
ing  [13],  high-order  dynamics  [9],  sex-dependent  shared- 
distribution  semi-continuous  hidden  Markov  models  [9,  15]. 
This  section  will  review  SPHINX-II,  which  will  be  used  as 
our  baseline  system  for  this  study  [6]. 
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2.1.  Signal  Processing 

The  input  speech  signal  is  sampled  at  16  kHz  with  a  pre¬ 
emphasized  filter,  1  —  0.9 Z~l.  A  Hamming  window  with 
a  width  of  20  msec  is  applied  to  speech  signal  every  10 
msec.  The  32-order  LPC  analysis  is  followed  to  compute 
the  12-order  cepstral  coefficients.  Bilinear  transformation  of 
cepstral  coefficients  is  employed  to  approximate  mel-scale 
representation.  In  addition,  relative  power  is  also  computed 
together  with  cepstral  coefficients.  Speech  features  used  in 
SPHINX-II  include  (t  is  in  units  of  10  msec)  LPC  cepstral 
coefficients;  40-msec  and  80-msec  differenced  LPC  cepstral 
coefficients;  second-order  differenced  cepstral  coefficients; 
and  power,  40-msec  differenced  power,  second-order  differ¬ 
enced  power.  These  features  are  vector  quantized  into  four 
independent  codebooks  by  the  Linde-Buzo-Gray  algorithm 
[21],  each  of  which  has  256  entries. 

2.2.  Training 

Training  procedures  are  based  on  the  forward-backward  al¬ 
gorithm.  Word  models  are  formed  by  concatenating  pho¬ 
netic  models;  sentence  models  by  concatenating  word  models. 
There  are  two  stages  at  training.  The  first  stage  is  to  gener¬ 
ate  the  shared  output  distribution  mapping  table.  Forty-eight 
context-independent  discrete  phonetic  models  are  initially  es¬ 
timated  from  the  uniform  distribution.  Deleted  interpolation 
[17]  is  used  to  smooth  the  estimated  parameters  with  the  uni¬ 
form  distribution.  Then  context-dependent  models  have  to 
be  estimated  based  on  context-independent  ones.  There  are 
7549  triphone  models  in  the  DARPA  RM  task  when  both 
within-word  and  between-word  triphones  are  considered.  To 
facilitate  training,  one  codebook  discrete  models  were  used, 
where  acoustic  feature  consists  of  the  cepstral  coefficients,  40- 
msec  differenced  cepstrum,  power  and  40-msec  differenced 
power.  After  the  7549  discrete  models  are  obtained,  the  dis¬ 
tribution  clustering  procedure  [14]  is  then  applied  to  create 
4500  distributions  (senones).  The  second  stage  is  to  train  4- 
codebook  models.  We  first  estimate  48  context  independent, 
four-codebook  discrete  models  with  the  uniform  distribution. 
With  these  context  independent  models  and  the  senone  ta¬ 
ble,  we  then  estimate  the  shared-distribution  SCHMMs  [9]. 
Because  of  substantial  difference  between  male  and  female 
speakers,  two  sets  of  sex-dependent  SCHMMs  are  are  sepa¬ 
rately  trained  to  enhance  the  performance. 

To  summarize,  the  configuration  of  the  SPHINX-II  system 
has: 

•  four  codebooks  of  acoustic  features, 

•  shared-distribution  between-word  and  within-word  tri¬ 
phone  models, 

•  sex-dependent  SCHMMs. 


2.3.  Recognition 

In  recognition,  a  language  network  is  pre-compiled  to  repre¬ 
sent  the  search  space.  For  each  input  utterance,  the  (artificial) 
sex  is  first  determined  automatically  as  follows  [8,  31].  As¬ 
sume  each  codeword  occurs  equally  and  assume  codeword  i 
is  represented  by  a  Gaussian  density  function  N(x,  fiit  Si). 
Then  given  a  segment  of  speech  xf ,  Prsej;,  the  probability  that 
xf  is  generated  from  codebook- sei  is  approximated  by; 

x:Eiog(N(x^.-.^)) 

*  ievi 

where  rjt  is  a  set  that  contains  the  top  N  codeword  indices 
during  quantization  for  cepstrum  data  x(  at  time  t.  If  Prm„ie 
>  Pr female,  then  xf  belongs  to  male  speakers.  Otherwise,  xf 
is  female  speech.  After  the  sex  is  determined,  only  the  models 
of  the  determined  sex  are  activated  during  recognition.  This 
saves  both  CPU  time  and  memory  requirement.  For  each 
input  utterance,  the  Viterbi  beam  search  algorithm  is  used  to 
find  out  the  optimal  state  sequence  in  the  language  network. 

3.  NEURAL  NETWORK  ARCHITECTURE 

3.1.  Codeword-Dependent  Neural  Networks 
(CDNN) 

When  presented  with  a  large  amount  of  training  data,  a  single 
network  is  often  unable  to  produce  satisfactory  results  dur¬ 
ing  training  as  each  network  is  only  suitable  to  a  relatively 
small  task.  To  improve  the  mapping  performance,  breaking 
up  a  large  task  and  modular  construction  are  usually  required 
[5, 7].  This  is  because  the  nonlinear  relationship  between  two 
speakers  is  very  complicated,  a  simple  network  may  not  be 
powerful  enough.  One  solution  is  to  partition  the  mapping 
spaces  into  smaller  regions,  and  to  construct  a  neural  network 
for  each  region  as  shown  in  Figure  1.  As  each  neural  net¬ 
work  is  trained  on  a  separate  region  in  the  acoustic  space,  the 
complexity  of  the  mapping  required  of  each  network  is  thus 
reduced.  In  Figure  1,  the  switch  can  be  used  to  select  the  most 
likely  network  or  top  N  networks  based  on  some  probability 
measures  of  acoustic  similarity  [10).  Functionally,  the  assem¬ 
bly  of  networks  is  similar  to  a  huge  neural  network.  However, 
each  network  in  the  assembly  is  learned  independently  with 
training  data  for  the  corresponding  regions.  This  reduces 
the  complexity  of  finding  a  good  solution  in  a  huge  space  of 
possible  network  configurations  since  strong  constraints  are 
introduced  in  performing  complex  constraint  satisfaction  in  a 
massively  interconnected  network. 

Vector  quantization  (VQ)  has  been  widely  used  for  data  com¬ 
pression  in  speech  and  image  processing.  Here,  it  can  be 
used  to  to  partition  original  acoustic  space  into  different  pro¬ 
totypes  (codewords).  This  partition  can  be  regarded  as  a 
procedure  to  perform  broad-acoustic  pattern  classification. 
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The  broad-acoustic  patterns  are  automatically  generated  via  a 
self-organization  procedure  based  on  the  LBG  algorithm  [21]. 
When  the  codeword-dependent  neural  network  (CDNN)  was 
constructed  from  the  data  in  the  corresponding  cell,  it  was 
found  that  learning  for  the  CDNN  converges  very  quickly 
in  comparison  with  a  huge  neural  network.  The  larger  the 
codebook,  the  quicker  it  converges.  However,  the  size  of 
codebook  relies  on  the  number  of  available  training  data  since 
codeword-dependent  structure  fragments  training  data.  The 
size  of  codebook  should  be  determined  experimentally. 

Speaker  normalization  involves  acoustic  data  transformation 
from  one  speaker  cluster  to  another.  In  general,  let  Xa  = 
...xf  be  a  sequence  of  observations  (frames)  at  time  1, 
2, ..  t  of  speaker  a.  Here,  each  observation  at  time  k,  xj,  is 
a  multidimensional  vector,  which  usually  characterizes  some 
short-time  spectral  features.  For  the  sequence  of  speech  obser¬ 
vations  X  °  produced  by  speaker-cluster  a,  our  goal  is  to  find  a 
mapping  function  T(Xa)  such  thatJ' (Ta)  resembles  the  cor¬ 
responding  sequence  of  observations  produced  by  speakers  in 
the  golden  speaker  cluster.  Speaker  variations  include  many 
factors  such  as  vocal  tract,  pitch,  speaking  speed,  intensity, 
and  cultural  differences.  Unfortunately,  given  two  different 
speakers,  there  is  no  simple  mapping  function  that  can  ac¬ 
count  for  all  these  variations.  Consequently,  we  are  mainly 
concerned  with  spectral  normalization.  For  each  frame  xa, 
we  want  to  find  out  a  mapping  function  to  transform  it  to  xb, 
the  corresponding  phonetic  realization  produced  by  speaker 
b.  We  believe  that  xf  can  represent  most  important  features 
produced  by  the  speaker.  Thus,  our  objective  functions  is  to 
minimize: 

E  X>(W-xb)  (1) 

corresponding  pairs 

where  X>(x,y)  denotes  a  predefined  distortion  measure  be¬ 
tween  frame  x  and  y,  and  corresponding  pairs  are  con¬ 


structed  to  approximate  acoustic  realizations  of  different 
speakers.  Even  if  we  are  only  interested  in  spectral  nor¬ 
malization,  there  is  no  analytic  mapping  solution.  Instead, 
stochastic  approach  has  to  be  used  to  study  the  nonlinear  re¬ 
lationship  between  the  two  observed  spaces.  We  need  to  have 
a  set  of  supervision  data  ( corresponding  pairs  in  Equation  1) 
to  extract  the  nonlinear  relationship. 

It  has  been  found  that  dynamic  information  plays  an  impor¬ 
tant  role  in  speech  recognition  [4, 20, 12].  As  frame  to  frame 
normalization  lacks  use  of  dynamic  information,  the  architec¬ 
ture  of  normalization  network  is  thus  chosen  to  incorporate 
multiple  neighboring  frames.  One  of  such  architectures  is 
shown  in  Figure  2.  Here,  the  current  frame  and  its  left  and 
right  neighboring  frames  are  fed  to  the  multi-layer  neural  net¬ 
work  as  inputs.  The  network  output  is  a  normalized  frame 
corresponding  to  the  current  input  frame.  By  using  multiple 
input  frames  for  the  network,  the  important  dynamic  informa¬ 
tion  can  be  effectively  used  in  estimating  network  parameters 
and  in  normalization.  In  Figure  2,  there  are  input  layer,  hid¬ 
den  layer,  and  output  layer.  Each  arc  k  is  associated  with 


normalized  frame 


previous  frame  current  frame  next  frame 


Figure  2:  A  basic  neural  network  architecture. 

a  weight  Wk.  In  the  hidden  and  output  layer,  each  node  is 
characterized  by  an  internal  offset  0.  The  hidden  node  is  also 
characterized  by  a  nonlinear  sigmoid  function.  The  input  to 
each  hidden  node  and  output  node  is  a  weighted  sum  of  cor¬ 
responding  inputs  with  the  offset  0.  Both  the  internal  offset 
and  arc  weights  are  learned  by  the  backpropagation  algorithm 
[30],  which  uses  a  gradient  search  to  minimize  the  objective 
function.  If  the  dimension  of  observation  space  is  d  and  the 
number  of  input  frames  is  m,  we  will  have  dxm  input  units 
in  the  normalization  network.  If  we  want  to  incorporate  more 
neighboring  frames,  this  will  definitely  increase  the  number  of 
free  parameters  in  the  network.  Although  the  increase  in  the 
number  of  free  parameters  lead  to  quick  convergence  during 
training,  this  nevertheless  may  not  lead  to  improved  general- 
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computed  as 


ization  capability.  Since  the  network  is  designed  to  normalize 
new  data  from  a  given  speaker  to  the  reference  speaker,  good 
generalization  capability  will  be  the  most  important  concern. 
Therefore,  a  compromise  has  to  be  made  between  generaliza¬ 
tion  capability  and  the  number  of  free  parameters. 

3.2.  Golden  Speaker-Cluster  Selection 

Speaker-dependent  CDNNs  have  been  used  successfully  for 
speaker-adaptive  speech  recognition  [7]  (speaker-dependent 
mapping).  If  we  need  to  map  multiple  speakers  to  one  golden 
speaker  and  simply  construct  a  speaker-independent  CDNN,  it 
is  unlikely  that  a  single  network  will  do  the  job.  With  the  same 
rational  as  CDNN  for  speaker-adaptive  speech  recognition, 
we  can  partition  multiple  speakers  into  speaker-clusters  and 
construct  cluster-dependent  CDNN. 

For  speaker  clustering,  we  first  generated  48  phonetic  HMM 
for  each  speaker  in  the  speaker-independent  training  database. 
Thus,  for  each  speaker,  we  have  a  set  of  output  distributions. 
We  then  merge  the  two  speaker-clusters  iteratively  that  re¬ 
sulted  in  the  least  loss  of  information,  and  then  move  ele¬ 
ments  from  cluster  to  cluster  to  improve  the  overall  quality. 
The  clustering  procedure  used  here  is  similar  to  the  one  used 
for  generalized  triphone  clustering  [19].  We  can  continue  the 
clustering  process  until  the  specified  speaker-clusters  are  ob¬ 
tained.  The  golden  speaker-cluster  is  the  one  that  contains  the 
largest  number  of  speakers.  We  generated  two  golden  clusters 
for  male  and  female  respectively. 

4.  EXPERIMENTAL  EVALUATION 

4.1.  Experiment  conditions 

Through  this  study,  only  the  cepstral  vectors  are  considered 
for  normalization.  Once  wc  have  the  normalized  cepstral  vec¬ 
tor,  the  first-order  and  second-order  time  derivatives  can  be 
computed.  We  first  clustered  all  the  speakers  in  the  train¬ 
ing  set  into  male  and  female  clusters,  and  then  generated  10 
speaker-clusters  for  male  and  7  speaker-clusters  for  female. 
We  selected  two  golden  speaker-clusters  for  both  male  and 
female.  There  were  13  and  6  speakers  in  the  male  and  female 
golden  cluster  respectively.  To  provide  learning  examples  for 
network  learning,  we  first  segmented  all  the  training  utter¬ 
ances  into  triphones  using  Viterbi  alignment  and  then  used 
the  DTW  algorithm  to  warp  the  data  to  the  corresponding  tri¬ 
phone  pairs  in  the  golden  speaker-cluster.  Thus,  for  a  given 
frame  of  each  training  speaker,  the  desired  output  frame  for 
network  learning  is  the  golden  speaker  frame  paired  in  the 
DTW  optimal  path. 

4.2.  Benchmark  Experiments 

As  benchmark  experiments,  speaker-independent  speech 
recognition  using  SPHINX-II  was  first  evaluated.  The  word 
error  rate  we  used  here  reflects  all  three  types  of  errors  and  is 


100 


substitutions  +  deletions  +  insertions 
totalwords  +  insertions 


(2) 


The  average  error  rate  was  3.8%  for  speaker-independent 
speech  recognition. 


4.3.  Normalization  Results 


The  input  of  the  network  consists  of  three  frames  from  the 
new  speaker.  Here,  12  cepstral  coefficients  and  energy  are 
used  together.  Thus,  there  are  93  input  units  in  the  network. 
The  output  of  the  network  has  1 3  units  corresponding  the  nor¬ 
malized  frame,  which  is  made  to  approximate  the  frame  of  the 
desired  reference  speaker.  The  energy  output  is  discarded  as  it 
is  relative  unstable.  The  objective  function  for  network  learn¬ 
ing  is  to  minimize  the  distortion  (mean  squared  error)  between 
the  network  output  and  the  desired  reference  speaker  frame. 
The  network  has  one  hidden  layer  with  20  hidden  units.  Each 
hidden  unit  is  associated  with  the  generalized  SIGMOID 
function,  where  a,  /?  and  7  are  predefined  to  be  4.0,  1.8, 
2.0  respectively.  They  are  fixed  for  all  the  experiments  con¬ 
ducted  here.  The  weights  and  offsets  in  the  network  were 
initialized  with  small  random  values.  The  learning  step  and 
momentum  are  controlled  dynamically.  Experimental  experi¬ 
ence  indicates  that  300  to  600  epochs  are  required  to  achieve 
acceptable  distortion.  We  created  two  golden  speaker  clusters 
for  male  and  female  respectively.  There  were  seven  female 
clusters  and  ten  male  clusters,  which  are  designed  according 
to  the  available  amount  of  male/female  training  data.  For  each 
speaker  cluster,  we  built  a  cluster-dependent  codebook  (size 
16).  For  the  input  speech  signal,  joint  VQ  pdfs  are  used  to  se¬ 
lect  the  top  2-5  clusters  for  normalization.  Thus,  let  A,  denote 
the  probability  that  acoustic  vector  belong  to  cluster  i,  and  A,- 
denote  the  normalized  vector  using  the  ith  cluster-dependent 
CDNN.  The  normalized  vector  y  can  then  be  computed  as 


-y  _  £,  A|  Xj 

£.  A> 


(3) 


With  the  same  training  conditions  as  used  in  SPHINX-II, 
when  the  speaker-normalized  front-end  is  used,  we  reduced 
the  error  rate  from  3.8%  to  3.3%,  which  represented  15%  error 
reduction.  The  modest  error  reduction  indicated  the  mapping 
quality  still  needs  to  be  improved  substantially. 

5.  SUMMARY 

In  this  paper,  the  codeword-dependent  neural  network 
(CDNN)  was  presented  for  speaker-independent  speech 
recognition.  The  network  was  used  as  a  nonlinear  mapping 
function  to  transform  speech  data  between  speakers  in  each 
cluster  and  the  golden  speaker  cluster.  Performance  evalu¬ 
ation  showed  that  speaker-normalized  front-end  reduced  the 
error  rate  by  15%,  as  shown  in  Figure  3,  for  the  DARPA 
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resource  management  speaker-independent  speech  recogni¬ 
tion.  If  we  compare  the  error  rate  of  speaker-dependent  and 
speaker-independent  systems,  this  15%  error  reduction  is  rela¬ 
tively  small.  We  believe  that  the  quality  of  mapping  functions 
is  extremely  important  if  we  want  to  bridge  the  gap  between 
speaker-dependent  and  speaker-independent  systems. 
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